The MSSP NCNM Team

The MSSP NCNM Presentation - Professor: Haviland Wright

  • Group 1: Jimmy Ye, Jinyu Li, Yuli Jin

  • Group 2: Daniel Xu, Kayla Choi, Nancy Shen

  • Group 3: Mi Zhang, Boyu Chen, Shicong Wang, Biyao Zhang

  • Group 4: Keliang Xu, Yingjie Wang, James He, Ruining Jia

Our Partners

  • Alison Turner: A Community Development Planner at NCNMEDD and recent MSSP graduate

  • Aidan O’Hara: Working with Alison since late July

  • Allen Razdow: Founder and president of True Engineering Technology, LLC and originator of Truenumbers

Project Background

  • The current developing situation in NCNM:

Historically, few resources to acquire grants
Trouble successfully administering grants to complete projects
Currently, at a turning point:
New pandemic-related dollars flowing to the region; have capital to spend on new projects
Two big issues of broadband access and issues of outmigration

  • What approaches are used for collecting data:

Census; they don’t collect a lot of data from their office
They would like recommendations on the gaps in census data or the insufficiencies that they’re seeing by the census as a region.

  • What variables will we use for this project? On what scales are they measured:

Demographics(categorical).
Income(numerical), range: 0-1,000,000,000,000 (unsure if this is the maximum) gross receipts tased.
Unemployment rate(numerical).
GDP(numerical).
Number of business establishments(numerical).

Project focus

The ED-900 form must accompany all EDA grant applications. Here’s an example:

Ultimate Goal:

  • TrueNumbers database that can be accessed by NCNMEDD and local government staff to assist with grant applications.

  • An analysis of the data from the region - we have fairly low census response rates which could lead to data quality issues

  • If data quality issues exist, come up with supplemental sources of data to improve inferences made about the region.

Focusing on for this semester:

  • TrueNumbers

  • Dive into what the census is, why it’s important, and how low response rates may pose an issue.

Our approach

  • Streamline the data acquisition, organization, and analysis process.

  • Using Tnum package, created function to extract county-level census data.

  • Visualization using ggplot to check the relationship between variables.

  • Create some models to have an in-depth insight of the grant situation of New Mexico

Truenumbers

Truenumbers continue..

Truenumbers continue…

Data

Data Source

Our data is from ACS(American Community Survey).

The ACS is a large demographic survey collected throughout the year using mailed questionnaires, telephone interviews, and visits from Census Bureau field representatives to about 3.5 million household addresses annually.

Data availability for geographic areas differs by population size:

1-year estimates are available for areas of population 65,000 or more, while 5-year estimates are available for all areas.

Data

Parameter interpretation

Estimates are produced for - demographic characteristics (sex,age,);
- social characteristics (school enrollment, educational attainment);
- economic characteristics (employment status, commuting to work);
- housing characteristics (housing occupancy, units in structure).

In this presentation, we basically focused on

Data

What did we do?
Design functions to clean the data in order to let everyone easily use ‘filter&select’ to tackle the data:

Modify Dataset: - clean_tag: Modify the columns of subjects and tags to more simple columns
- get_county: Modify the column of county to more simple column

Function for tackling the data for further analysis: - get_county_data: This function is designed to get the data from different county

You still need some steps to get the data: Connect BU VPN: vpn.bu.edu
Run the following code and wait for seconds:
- source(file=‘data_clean/mexico_screen_function.R’)
- data<-get_county_data()
- data %>% view()

EDA Appetizer

This images shows the overall population of New Mexico as well as the eight counties that we are interested in.

EDA Appetizer

This is a plot shows the percentages of the observations of eight counties in our data. We can see that Santa Fe and Sandoval are the two major counties where census collected more data.

EDA Appetizer

By comparing the races of eight counties which shows the dominant race is white in all of them. In addition, both Sandoval and Santa Fe have larger population then rest of them.

EDA Appetizer

The figure can clearly reflect the difference in per capita income between different counties, as well as the change in per capita income in the same county every year.

Conclusion

What we have done for now:
Tnum part:
  1. Summarize and understand the Tnum functions;
  2. Familiar with Tnum cheatsheet and making tree graphs
Data part:
  1. Extract 43796 raw observations;
  2. Have Designed two function to split the two columns contain txt into seven columns;
  3. Extract some useful information like time, county, etc for EDA team.
EDA:
  1. The majority of census data collected from Sandoval and Santa Fe;
  2. The dominant race in these eight counties is white;
  3. But we also found that the gap between the poor and the rich is very large. The relationship between population and per capita income is Inversely Proportional.

What we are going to do next:

  1. Come up with mapping, text, graphs routine so that the we can easily re-create new pictures in the future.
  2. Set a standard for future presentation slides.
  3. From the present basic analysis to more detailed geolocated maps and graphs, including bar charts, pie chart, etc.
  4. Further analyze the data we have, and try to get more numeric values.
  5. Design a Tnum database and paper describing any findings on data quality to help set up a standard to apply for grant.

Questions

Thank you